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ABSTRACT 

Evolutionary Learning proceeds by evolving a population of 
classifiers, from which it generally returns (with some no- 
table exceptions) the single best-of-run classifier as final re- 
sult. In the meanwhile, Ensemble Learning, one of the most 
efficient approaches in supervised Machine Learning for the 
last decade, proceeds by building a population of diverse 
classifiers. Ensemble Learning with Evolutionary Compu- 
tation thus receives increasing attention. The Evolutionary 
Ensemble Learning (EEL) approach presented in this paper 
features two contributions. First, a new fitness function, in- 
spired by co-evolution and enforcing the classifier diversity, 
is presented. Further, a new selection criterion based on 
the classification margin is proposed. This criterion is used 
to extract the classifier ensemble from the final population 
only (Off- EEL) or incrementally along evolution (On- EEL). 
Experiments on a set of benchmark problems show that 
Ojff-EEL outperforms single-hypothesis evolutionary learn- 
ing and state-of-art Boosting and generates smaller classifier 
ensembles. 

Categories and Subject Descriptors 

1.5.2 [Pattern Recognition]: Design Methodology — Clas- 
sifier design and evaluation; 1.2.8 [Artificial Intelligence]: 
Problem Solving, Control Methods, and Search — Heuristic 
methods 

General Terms 

Algorithms 
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1. INTRODUCTION 

Ensemble Learning, one of the main advances in Super- 
vised Machine Learning since the early 90's, relies on: i) a 
weak learner (extracting hypotheses, aka classifiers, with er- 
ror probability less than 1/2 — e, e > 0); ii) a diversification 
heuristics used to extract sufficiently diverse classifiers; iii) 
a voting mechanism, aggregating the diverse classifiers con- 
structed [U|5]. If the classifiers are sufficiently diverse and 
their errors are independent, then their majority vote will 
reach an arbitrarily low error rate on the training set as the 
number of classifiers increases [6]. Therefore, up to some re- 
strictions on the classifier space [25], the generalization error 
will also be low 1 . 

The most innovative aspect of Ensemble Learning w.r.t. 
the Machine Learning literature concerns the diversity re- 
quirement, implemented through parallel or sequential heu- 
ristics. In Bagging, diversity is enforced by considering inde- 
pendent sub-samples of the training set, and/or using differ- 
ent learning parameters pfj. Boosting iteratively constructs 
a sequence of classifiers, where each classifier focuses on the 
examples misclassified by the previous ones [8]. 

Diversity is also a key feature of Evolutionary Computa- 
tion (EC): in contrast with all other stochastic optimization 
approaches, evolutionary algorithms proceed by evolving a 
population of solutions, and the diversity thereof has been 
stressed as a key factor of success since the beginnings of 
EC. Deep similarities between Ensemble Learning and EC 
thus appear; in both cases, diversity is used to escape from 
local minima, where any single "best" solution is only too 
easily trapped. Despite this similarity, Evolutionary Learn- 
ing has most often (with some notable exceptions, see [141 
ITUl [T5] among others) focused on single-hypothesis learning, 
where some single best-of-run hypothesis is returned as the 
solution. 

However, the evolutionary population itself could be used 

1 In practice, the generalization error is estimated from the 
error on a test set, disjoint from the training set. The reader 
is referred to [4] for a comprehensive discussion about the 
comparative evaluation of learning algorithms. 



as a pool for recruiting the elements of an ensemble, en- 
abling "Ensemble Learning for Free". Previous work along 
this line will be described in Section [2] mostly based on us- 
ing an evolutionary algorithm as weak learner [17], or using 
evolutionary diversity-enforcing heuristics |16l 118] . 

In this paper, the "Evolutionary Ensemble Learning For 
Free" claim is empirically examined along two directions. 
The first direction is that of the classifier diversity; a new 
learning-oriented fitness function is proposed, inspired by 
the co-evolution framework [TSJ and generalizing the diver- 
sity-enforcing fitness proposed by [18] . The second direction 
is that of the selection of the ensemble classifiers within the 
evolutionary population(s). Selecting the best classifiers in a 
pool amounts to a feature selection problem, that is, a com- 
binatorial optimization problem |12j . A greedy set-covering 
approach is used, build on a margin-based criterion inspired 
by Schapire et al. [23]. Finally, the paper presents two 
Evolutionary Ensemble Learning (EEL) approaches, called 
Off-EEL and On-EEL, respectively tackling the selection of 
the ensemble classifiers in the final population, or along evo- 
lution. 

Paper structure is as follows. Section [5] reviews and dis- 
cusses some work relevant to Evolutionary Ensemble Learn- 
ing. Section [3] describes the two proposed approaches Off- 
EEL and On-EEL, introducing the specific fitness function 
and the ensemble classifier selection procedure. Experi- 
mental results based on benchmark problems from the UCI 
repository are reported in Section [4] The paper concludes 
with some perspectives for further research, discussing the 
priorities for a tight coupling of Ensemble Learning with 
Evolutionary Optimization in terms of dynamic systems [22] . 

2. RELATED WORK 

Interestingly, some early approaches in Evolutionary Learn- 
ing were rooted on Ensemble Learning ideas 2 . The Michigan 
approach [14] evolves a population made of rules, whereas 
the Pittsburgh approach evolves a population made of sets 
of rules. What is gained in flexibility and tractability in the 
Michigan approach is compensated by the difficulty of as- 
sessing a single rule, for the following reason. A rule usually 
only covers a part of the example space; gathering the best 
rules (e.g. the rules with highest accuracy) does not result 
in the best ruleset. Designing an efficient fitness function, 
such that a good quality ruleset could be extracted from the 
final population, was found a tricky task. 

In the last decade, Ensemble Learning has been explored 
within Evolutionary Learning, chiefly in the context of Ge- 
netic Programming (GP). A first trend directly inspired 
from Bagging and Boosting aims at reducing the fitness com- 
putation cost [7] [16] and/or dealing with datasets which do 
not fit in memory [24] , For instance, Iba [TS] divided the GP 
population into several sub-populations which are evaluated 
on subsets of the training set. Folino et al. [7] likewise sam- 
pled the training set in a Bagging-like mode in the context 
of parallel cellular GP. Song et al. [24] used Boosting-like 
heuristics to deal with training sets that do not fit in mem- 
ory; the training set is divided into folds, one of which is 
loaded in memory and periodically replaced; at each gen- 
eration, small subsets are selected from the current fold to 

2 Learning Classifier Systems (LCS, [141115] ) are mostly de- 
voted to Reinforcement Learning, as opposed to Supervised 
Machine Learning; therefore they will not be considered in 
the paper. 



compute the fitness function, where the selection is nicely 
based on a mixture of uniform and Boosting-like distribu- 
tions. 

The use of Evolutionary Algorithms as weak learners with- 
in a standard Bagging or Boosting approach has also been 
investigated. Boosting approaches for GP have been applied 
for instance to classification [21] or symbolic regression [17] : 
each run delivers a GP tree minimizing the weighted sum 
of the training errors, and the weights were computed as in 
standard Boosting [5]. While such ensembles of GP trees 
result, as expected, in a much lower variance of the perfor- 
mance, they do not fully exploit the population-based nature 
of GP, as independent runs are launched to learn successive 
classifiers. 

Liu et al. [TS] proposed a tight coupling between Evo- 
lutionary Algorithms and Ensemble Learning. They con- 
structed an ensemble of Neural Networks, using a modified 
back-propagation algorithm to enforce the diversity of the 
networks; specifically, the back-propagation aims at both 
minimizing the training error and maximizing the negative 
correlation of the current network with respect to the current 
population. Further, the fitness associated to each network 
is the sum of the weights of all examples it correctly classi- 
fies, where the weight of each example is inversely propor- 
tional to the number of classifiers that correctly classify this 
example. While this approach nicely suggests that ensemble 
learning is a Multiple Objective Optimization (MOO) prob- 
lem (minimize the error rate and maximize the diversity), 
it classically handles the MOO problem as a fixed weighted 
sum of the objectives. 

The MOO perspective was further investigated by Chan- 
dra and Yao in the DIVACE system, a highly sophisticated 
system for the multi-level evolution of ensemble of classi- 
fiers [2] [3J. In [3J, the top-level evolution simultaneously 
minimizes the error rate (accuracy) and maximizes the neg- 
ative correlation (diversity). In [2], the negative correlation- 
inspired criterion is replaced by a pairwise failure crediting; 
the difference concerns the misclassification of examples that 
are correctly classified by other classifiers. Finally, the en- 
semble is constructed either by keeping all classifiers in the 
final population, or by clustering the final population (after 
their phenotypic distance) and selecting a classifier in each 
cluster. 

While the MOO perspective nicely captures the interplay 
of the accuracy and diversity goals within Ensemble Learn- 
ing, the selection of the classifiers in the genetic pool as 
done in [2] [3j does not fully exploit the possibilities of evo- 
lutionary optimization, in two respects. On the one hand, it 
only considers the final population that usually involves up 
to a few hundred classifiers, while learning ensembles com- 
monly involve some thousand classifiers. On the other hand, 
clustering-based selection proceeds on the basis of the phe- 
notypic distance between classifiers, considering again that 
all examples are equally important, while the higher stress 
put on harder examples is considered the source of the better 
Boosting efficiency [5]. 

3. ENSEMBLE LEARNING FOR FREE 

After the above discussion, Evolutionary Ensemble Learn- 
ing (EEL) involves two critical issues: i) how to enforce both 
the predictive accuracy and the diversity of the classifiers in 
the population, and across generations; ii) how to best se- 
lect the ensemble classifiers, from either the final population 



only or all along evolution. 

Two EEL frameworks have been designed to study these 
interdependent issues. The first one dubbed Offline Evolu- 
tionary Ensemble Learning ( Off-EEL) constructs the ensem- 
ble from the final population only. The second one, called 
Online Evolutionary Ensemble Learning (On-EEL), gradu- 
ally constructs the classifier ensemble as a selective archive 
of evolution, where some classifiers are added to the archive 
at each generation. 

Both approaches combine a standard generational evo- 
lutionary algorithm with two interdependent components: 
a new diversity- enhancing fitness function, and a selection 
mechanism. The fitness function, presented in Section 13.11 
and generalizing the fitness devised by Liu et al. [18], is in- 
spired from co-evolution |13| . The selection process is used 
to extract a set of classifiers from either the final population 
(Off-EEL) or the current archive plus the current popula- 
tion (On-EEL), and proceeds by greedily maximizing the 
ensemble margin (Section 13. 2p . 

Only binary or multi-class classification problems are con- 
sidered in this paper. The decision of the classifier ensemble 
is the majority vote among the classifiers (ties being arbi- 
trarily broken). 

3.1 Diversity-enforcing Fitness 

Traditionally, Evolutionary Learning maximizes the num- 
ber of correctly classified training examples (or equivalently 
minimizes the error rate). However, examples are not equally 
informative; therefore a rule correctly classifying a hard ex- 
ample (e.g. close to the frontiers of the target concept) is 
more interesting and should be more rewarded than a rule 
correctly classifying an example which is correctly classified 
by almost all rules. 

Co-evolutionary learning, first pioneered by Hillis [13] . 
nicely takes advantage of the above remark, gradually forg- 
ing more and more difficult examples to enforce the discov- 
ery of high-quality solutions. Boosting proceeds along the 
same lines, gradually putting the stress on the examples 
which have not been successfully predicted so far. 

A main difference between both frameworks is that Boost- 
ing exploits a finite set of labelled examples, while co-evol- 
utionary learning has an infinite supply of labelled examples 
(since it embeds the oracle). A second difference is that the 
difficulty of an example depends on the whole sequence of 
classifiers in Boosting, whereas it only depends on the cur- 
rent classifier population in co-evolution. In other words, 
Boosting is a memory-based process, while co-evolutionary 
learning is a memoryless one. Both approaches thus suf- 
fer from opposite weaknesses. Being a memory-based pro- 
cess, Boosting can be misled by noisy examples; consistently 
misclassified, these examples eventually get heavy weights 
and thus destabilize the Boosting learning process. Quite 
the contrary, co-evolution can forget what has been learned 
during early stages and specific heuristics, e.g. the so-called 
Hall-of-Fame, archive of best-so-far individuals, are required 
to prevent co-evolution from cycling in the learning land- 
scape [20] . 

Based on these ideas, the fitness of classifiers is defined in 
this work from a set of reference classifiers noted Q. The 
hardness of every training example x is measured after the 
number of classifiers in Q which misclassify x. The fitness of 
every classifier h is then measured by the cumulated hard- 
ness of the examples that are correctly classified by h. 



Three remarks can be made concerning this fitness func- 
tion. Firstly, contrasting with standard co-evolution, there 
is no way classifiers can "unlearn" to classify the training ex- 
amples, since the training set is fixed. Secondly, as in Boost- 
ing, the fitness of a classifier reflects its diversity with respect 
to the reference set. Lastly, the classifier fitness function is 
highly multi-modal compared to the simple error rate: good 
classifiers might correctly classify many easy examples, or 
sufficiently many hard enough examples, or a few very hard 
examples. 

Formally, let £ = {(xi,j/i), x ; € X, y t € Y, i = l...n} 
denote the training set (referred to as set of fitness cases 
in the GP context); each fitness case or example (xi,j/i) is 
composed of an instance Xi belonging to the instance space 
X and the associated label yi belonging to a finite set Y. 
Any classifier h is a function mapping the instance space X 
onto Y . The loss function I is defined as i : Y x Y t— > R, 
where £(y,y') is the (real valued) error cost of predicting 
label y instead of the true label y' . 

The hardness or weight of every training example (x^, yC), 
noted wf 1 , or Wi when the reference set Q is clear from 
the context, is the average loss incurred by the reference 
classifiers on (xi,j/i): 

^ heQ 

The cumulated hardness fitness T is finally defined as fol- 
lows: J-(h) is the sum over all training examples that are 
correctly classified by h, of their weight Wi raised to power 
7. Parameter 7 governs the importance of the weights Wi 
(the cumulated hardness boils down to the number of cor- 
rectly classified examples for 7 = 0) and thus the diversity 
pressure. 

Hh)= E w l ( 2 ) 

h( Xi )= yi 

Parameter 7 can also be adjusted depending on the level 
of noise in the dataset. As noisy examples typically reach 
high weights, increasing the value of 7 might lead to retain 
spurious hypotheses, which happen to correctly classify a 
few noisy examples. When I is set to the step loss function 
(£{y,y') = if y = y', 1 otherwise) and 7 is set to 1, the 
above fitness function is the same as the one used by Liu 
et al. 18 . The value of 7 is set to 2 in the experiments 
(Section [IJ. 

3.2 Ensemble Selection 

As noted earlier on, the selection of classifiers in a pool 
TL = {hi, . . . ,hr} in order to form an efficient ensemble 
is formally equivalent to a feature selection problem. The 
equivalence is seen by replacing the initial instance space X 
with the one defined from the classifier pool, where each in- 
stance Xi is redescribed as the vector (/ii(xi), . . . , /i-r(xi)). 
Feature selection algorithms [T5] could thus be used for en- 
semble selection; unfortunately, feature selection is one of 
the most difficult Machine Learning problems. 

Therefore, a simple greedy selection process is used in this 
paper to select the classifiers in the diverse pools considered 
by the Off-EEL (Section E3} and On-EEL (Section EU) al- 
gorithms. The novelty is the selection criterion, generalizing 
the notion of margin [111 I23| to an ensemble of examples as 
follows. 



Figure 1: Pseudo-code of Ensemble-Selection(Classifier pool TL, training set £, initial classifier ensemble Co)- 



1. Let t = 1, and TLi be the set TL with duplicate individuals removed 

2. While TLt is not empty: 

(a) Let hi — argmax fteHt (Ct-i U {ft}) after the margin-based order relation of Equation [5] 

(b) Let Ht+i = TLt\{hl } (remove /it from 7it) 

(c) Let C t = £t-i U {h^} (and add it to C t ) 

(d) t = t + l 

3. Return C* , the classifier ensemble in {Co ■ ■ ■ £t-i} that achieves the lowest error rate on £, selecting the smallest 
ensemble in case of ties. 



Formally, let C denote the current ensemble, initialized to 
the classifier h* with minimum error rate in TL. For each 
example (xi,j/i), let its margin m% be defined as follows. 
Let y'i be the class most frequently associated to Xj by the 
classifiers in C, such that y' t is different from the true class 
Let a (respectively c£) denote the number of classifiers 
in C associating class j/i (resp. y[) to x^. Then margin mi 
is defined as c; — Cj. A positive margin thus denotes the 
fact that the example is correctly classified by the majority 
vote; the higher the margin, the more confident the ensemble 
prediction. Conversely, a negative margin denotes an error; 
the ensemble misclassifies the example as belonging to class 
y'i, the more negative the margin, the more classifiers need 
to be added to the ensemble in order to correctly classify x^ . 

Let K denote the number of classes of the problem and 
\A\ the size of a set A. The above definitions then read: 

y'i = argmax |{/ij(xi) = k, hj G C}\ , (3) 

k=l...K 

rrii = \{hj(xi) = yi, hj€C}\- 

|{fti(xi)=i4, hj€C}\. (4) 

Initially, the quality of ensemble C was measured after its 
minimum margin when (xi,j/i) ranges over the training set, 
and the selection process aimed at maximizing the minimum 
margin likewise Boosting 22 . However, it turned out exper- 
imentally that the minimum margin alone is too coarse a cri- 
terion, leading to many ties. Thus, a finer grained criterion, 
based on the margin histogram, has finally been defined. 

Let c(C, m) denote the number of training examples with 
margin m after C. An order relation on classifier ensembles 
C and C can then be defined by comparing c(C, m) and 
c(C',m) for increasing values of m; the best ensemble is 
the one with lesser number of examples with the smallest 
margin. 

C < £ iff 3 mo s.t. ( V "! < 7< C ^r ] T C{£ '' m) (5) 

The pseudo-code of the ensemble selection algorithm is 
displayed in Figure[T] It starts with a classifier pool Ti., a set 
of training examples £ and an initial set of classifiers Co. It 
then iteratively moves all classifiers from Ti into C, based on 
the above order on ensembles. Ultimately, the ensemble with 
lowest error rate on £ in the ensemble sequence Co . . . Ct—i 
is selected. 



3.3 Offline Evolutionary Ensemble Learning 

Ojff-EEL is a two-step process. It firstly runs a standard 
evolutionary learning algorithm. The approach does not 
make any requirement on the genetic search space, that is 
the classifier space; the designer can run Off-EEL on the top 
of her favorite evolutionary learning algorithm, searching for 
linear classifiers, neural nets, rule systems, or genetic pro- 
grams. The only required modification concerns the fitness 
function, which is set to the diversity-enhancing fitness de- 
scribed in Section [3.11 taking the whole current population 
as set of reference classifiers. In contrast with Boosting, the 
process does not maintain any memory about the examples; 
their weights are recomputed from scratch at each genera- 
tion. While Boosting might result in exponentially increas- 
ing the weight of hard or possibly noisy examples, O/f-EEL 
thus keeps the weight of each training example bounded, 
and thereby avoids the instability due to the data noise. 

The second step achieves the ensemble selection based on 
the margin-based criterion (Section 13.21 and Figure [1} . It 
uses the final population as pool of classifiers TL, and ini- 
tializes the classifier ensemble to the classifier h* that has 
the smallest error rate on the training set in the population 
(C = {ft*}). 

3.4 Online Evolutionary Ensemble Learning 

In contrast with Off-EEL, On-EEL interleaves evolution- 
ary learning and ensemble selection; at each generation the 
classifier ensemble is updated using the current population. 

At generation 1, the classifier ensemble is initialized to the 
classifier that minimizes the error rate on the training set. 
In further generations, the current population is evolved us- 
ing the diversity-enhancing fitness function with the current 
ensemble as reference set (Section 13. and the ensemble 
selection algorithm (Figure[T| is launched, using the current 
population as classifier pool TL, and the current classifier en- 
semble as Co- The pseudo-code of On-EEL is given in Figure 


Notably, Ojff-EEL and On-EEL achieve different Explo- 
ration vs Exploitation trade-offs. In Off-EEL, the set of ref- 
erence classifiers is the current population; the fitness func- 
tion thus favors both accurate and diverse classifiers in each 
generation. The ensemble selection algorithm is launched 
only once, on a high quality and diversified pool of classi- 
fiers. 

In On-EEL, the set of reference classifiers is the current 



Figure 2: Pseudo-code of On-EEL (training set £). 



1. Let Vi be the first evolutionary population, and h* the 
classifier with minimal error rate on £. 

2. Ci = Ensemble-Selection('Pi , £ , {h*}) 

3. For t = 2 . . . T: 

(a) Evolve Vt-i — > Pt, using £t_i as reference set. 

(b) £ t = Ensemble-Selection^t,^ ,£t-i) 

4. Return Ct- 



classifier ensemble; like in Boosting, the goal is to find classi- 
fiers which overcome the errors of the past classifiers. While 
the ensemble selection algorithm is launched at every gen- 
eration, it uses the biased current population as classifier 
pool. In fact, On-EEL addresses a dynamic optimization 
problem; if the classifier ensemble significantly changes be- 
tween one generation and the next, the fitness landscape 
will change accordingly and several evolutionary generations 
might be needed to accommodate this change. On the other 
hand, as long as the current population does not perform 
well, the ensemble selection algorithm is unlikely to select 
further classifiers in the current ensemble; the fitness land- 
scape thus remains stable. The population diversity does 
not directly result from the fitness function as in the Off- 
EEL case; rather, it relates with the dynamic aspects of the 
fitness function. 



4. EXPERIMENTAL SETTING 

This section describes the experimental setting used to 
assess the EEL framework. 

4.1 Datasets 

Experiments are conducted on the six UCI datasets [19] 
presented in Table [1] The performance of each algorithm is 
measured after a standard stratified 10-fold cross-validation 
procedure. The dataset is partitioned into 10 folds with 
same class distribution. Iteratively, all folds but the i-th 
one are used to train a classifier, and the error rate of this 
classifier on the remaining i-th fold is recorded. The per- 
formance of the algorithm is averaged over 10 runs for each 
fold, and over the 10 folds. 

4.2 Classifier Search Space 

As mentioned earlier on, evolutionary ensemble learning 
can accommodate any type of classifier; O/f-EEL and On- 
EEL could consider neural nets, genetic programs or decision 
lists as genotypic search space. Our experiments will con- 
sider the most straightforward classifiers, namely separating 
hyperplanes, as these can easily be inspected and compared. 
Formally, let X = H d be the instance space, a separating 
hyperplane classifier h is characterized as (w, b) £ R d x M 
with /i(x) = < w, x > — b (< w, x > denotes the scalar 
product of w and x). The search for a separating hyper- 



plane is amenable to quadratic optimization, with: 

F{h)= (Hxi)-Vi) 2 - (6) 

i — 1 . . .n 

As the above optimization problem can be tackled using 
standard optimization algorithms, it provides a well-founded 
baseline for comparison. Specifically, the first goal of the 
experiments is thus to assess the merits of evolutionary en- 
semble learning against three other approaches. 

The first baseline algorithm referred to as Least Mean 
Square (LMS) uses a stochastic gradient algorithm to de- 
termine the optimal separating hyperplane in the sense of 
criterion given by Equation [S] (see pseudo-code in Figure 

The second baseline algorithm is an elementary evolution- 
ary algorithm, producing the best-of-run separating hyper- 
plane such that it minimizes the (training) error rate 3 . 

The third reference algorithm is the prototypical ensem- 
ble learning algorithm, namely AdaBoost with its default 
parameters [8|. AdaBoost uses simple decision stumps [23] 
baseline algorithm as weak learner (more on this below). 

The learning error is classically viewed as composed from 
a variance term and a bias term [I]. The bias term mea- 
sures how far the target concept tc is from the classifier 
search space H, that is, from the best classifier h* in this 
search space. The variance term measures how far away one 
can wander from h* , wrongly selecting other classifiers in TL 
(overfitting) . 

The comparison of the first and second baseline algorithms 
gives some insight into the intrinsic difficulty of the problem. 
Stochastic gradient (LMS) will find the global optimum for 
criterion given by Equation [6] but this solution optimizes at 
best the training error. The comparison between the solu- 
tions respectively found by LMS and the simple evolutionary 
algorithm will thus reflect the learning variance term. 

Similarly, the comparison of the first baseline algorithm 
and AdaBoost gives some insight into how the ensemble im- 
proves on the base weak learner; this improvement can be 
interpreted in terms of variance as well as in terms of bias 
(since the majority vote of decision stumps allows for de- 
scribing more complex regions than simple separating hy- 
perplanes alone). 

4.3 Experimental Setting 

The parameters for the LMS algorithm (see Figure [3} are 
as follows: the training rate, set to r/(t) = l/(n\ft), decreases 
over the training epochs; the maximum number of epochs 
allowed is T = 10000; the stopping criterion is when the 
difference in the error rates over two consecutive epochs, is 
less that some threshold e (e = 10~ 7 ). Importantly, LMS 
requires a preliminary normalization of the dataset, (e.g. 
Vi = l...n, Xi £ [—1, l] d ). The final result is the error on 
the test set, averaged over 10 runs for each fold (because of 
the stochastic reordering of the training set) and averaged 
over 10 folds. 

The classical AdaBoost algorithm [8] uses simple decision 
stumps [23], and the number of Boosting iterations is limited 
to 2000. Decision stumps are simple binary classifiers that 

3 For 3-classes problems, e.g. bos or cmc, the classifier is 
characterized as two hyperplanes, respectively separating 
class (resp. class 1) from the other two classes. In case of 
conflict (the example is simultaneously classified in class 
by the first classifier and in class 1 by the second classifier), 
the tie is broken arbitrarily. 



Dataset 


Size 


# 

features 


Table 1: UCI datasets used for the experimentations. 

# 

classes Application domain 


bcw 


683 


9 


2 


Wisconsin's breast cancer, 65 % benign and 35 % malignant. 


bid 


345 


6 


2 


BUPA liver disorders, 58 % with disorders and 42 % without disorder. 


bos 


508 


13 


3 


Boston housing, 34 % with median value v < 18.77 K$, 33 % with v G] 18.77, 23.74], 










and 33 % with v > 23.74. 


cmc 


1473 


9 


3 


Contraceptive method choice, 43 % not using contraception, 35 % using short-term 










contraception, and 23 % using long-term contraception. 


pid 


768 


8 


2 


Pima indians diabetes, 65 % tested negative and 35 % tested positive for diabetes. 


spa 


4601 


57 


2 


Junk e-mail classification, 61 % tested non-junk and 39 % tested junk. 



Figure 3: Least-mean square training algorithm. 



Table 2: Parameters for the real-valued GA. 

Parameter Value 



1. Initialize w = and b 


= 


2. For t = 1 . . . T: 




(a) Shuffle the dataset £ = {(xi, yi), i — 1 . . . n} 


(b) For i — 1 . . . n: 






= < w, Xi > — b 


A s 


= 27/(t)(o< - J/i) 
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classify data according to a threshold value on one of the 
features of the data set. If the feature value of a given data 
is less (or greater) than the threshold, the data is assigned 
to a given class, otherwise it is assigned to another class. 
Decision stumps are trained deterministically, by looping 
over all features and all features threshold for a given train- 
ing dataset, selecting the feature, threshold, and comparison 
operation on the threshold (> or <) that maximize the clas- 
sification accuracy on the training data set. Decision stumps 
are the simplest possible linear classifiers, but generate good 
results in combination with AdaBoost. 

The elementary evolutionary algorithm is a real-valued 
generational GA using SBX crossover, Gaussian mutations, 
and tournament selection. The search space is R ti+1 for 
binary classification problems, and R 2d+2 for ternary classi- 
fication problem, where d is the number of attributes in the 
problem domain. The evolutionary parameters are detailed 
in Table [5] All experiments with the real- valued GA rely on 
the C++ framework Open BEAGLE [9l[T0]. 



5. RESULTS 

This section reports on the experimental results obtained 
by O/f-EEL and On-EEL, compared to the three baseline 
methods respectively noted LMS (optimal linear classifier), 
GA (genetically evolved linear classifier) and Boosting (en- 
semble of decision stumps), on the six UCI data sets de- 



Population size 
Termination criteria 
Tournament size 
Initialization range 
SBX crossover prob. 
SBX crossover n-value 
Gaussian mutation prob. 
Gaussian mutation std. dev. 



500 

100000 fitness evaluations 
2 

[-1,1] 
0.3 
n = 2 
0.1 

a = 0.05 



scribed in Table [T] For each method and problem, the aver- 
age test error (over 100 independent runs as described in Sec- 
tion [4j and the associated standard deviation are displayed 
in Table [3] The average computational effort of Off-EEL for 
a run ranges from 30 seconds (on problem bid) to 20 min- 
utes (on problem spa), on AMD Athlon 1800+ computers 
with 1G of memory. For On-EEL, the average computa- 
tional effort for a run ranges from 2 hours (on problem pid) 
to 24 hours (on problem spa), on the same computers. 

With respect to the baseline algorithms, a first remark is 
that the LMS-based classifier is significantly outperformed 
by all other methods, on all problems but one (pid). This is 
explained as the criterion given by Equation[6]uselessly over- 
constrains the learning problem, replacing a set of linear 
inequalities with the minimization of the sum of quadratic 
terms. Similarly, the single-hypothesis evolutionary learning 
is dominated by all other methods on all problems but one 
(bcw). Boosting shows its acknowledged efficiency as it is 
the best algorithm on two out of six problems ( Off-EEL and 
Boosting are both best performers for the cmc problem). 

O/f-EEL is the best method for three out of six problems 
tested. Compared to AdaBoost, it generates ensemble with 
lower test error rate on four problems, with a tie for the cmc 
problem, and AdaBoost being the best on spa problem. In 
all cases, the number of classifiers is lower, with an aver- 
age between 235 and 335 classifiers for Ojff-EEL compared 
with more than 750 on all problems but bcw for Boosting. 
This is understandable given that the ensembles are built 
with O/f-EEL starting from a population of 500 individu- 
als. This raises the question on whether the evolutionary 
learning accuracy could be improved by considering larger 
population sizes. But it should not be forgotten that the de- 
cision stumps classifier making the AdaBoost ensembles are 
significantly simpler than the evolved linear discriminants 
of Ojff-EEL. No clear conclusion can thus be made on the 
relative complexity of the ensembles generated by Ojff-EEL 



Table 3: Results on the UCI datasets based on 10-folds cross-validation, using 10 independent runs over each 
fold. Values are averages (standard deviations) over the 100 runs. Statistical tests are p-values of paired 
t-tests on the test error rate compared to that of the best method on the dataset (in bold). 



Measure 


LMS 


GA 


Boosting 


Off-EEL 


On-EEL 


bcw 


Train error 


3.9% (0.2%) 


1.8% (0.2%) 


0.0% (0.0%) 


1.4% (0.2%) 


0.4% (0.4%) 


Test error 


4.0% (1.6%) 


3.2% (1.7%) 


5.3% (2.0%) 


3.4% (1.7%) 


3.5% (2.0%) 


Test error p- value 


0.00 




0.00 


0.09 


0.04 


Ensemble size 






291.6 (68.2) 


235.6 (66.8) 


116.3 (278.2) 


bid 


Train error 


29.8% (0.9%) 


25.4% (1.2%) 


0.0% (0.0%) 


20.9% (1.5%) 


18.9% (2.0%) 


Test error 


30.4% (6.6%) 


32.7% (6.6%) 


30.4% (5.4%) 


29.2% (7.4%) 


29.5% (8.4%) 


Test error p- value 


0.04 


0.00 


0.14 




0.64 


Ensemble size 






1081.4 (166.1) 


301.0 (37.9) 


294.1 (154.2) 


bos 


Train error- 


32.2% (1.3%) 


23.4% (4.1%) 


0.0% (0.0%) 


16.7% (1.9%) 


20.9% (2.3%) 


Test error 


34.0% (6.7%) 


30.7% (7.5%) 


26.9% (4.2%) 


22.7% (5.7%) 


26.2% (7.2%) 


Test error p- value 


0.00 


0.00 


0.00 




0.00 


Ensemble size 






761.1 (40.8) 


303.8 (41.4) 


2960.9 (2109.3) 


cmc 


Train error 


51.6% (0.4%) 


45.7% (1.4%) 


43.3% (0.7%) 


42.9% (1.2%) 


43.9% (1.4%) 


Test error 


51.8% (2.5%) 


50.4% (3.9%) 


46.8% (2.9%) 


46.8% (3.9%) 


47.7% (3.9%) 


Test error p- value 


0.00 


0.00 


0.99 




0.04 


Ensemble size 






4000.0 (0.0) 


326.4 (35.7) 


2707.7 (1696.1) 


pid 


Train error 


22.0% (0.6%) 


20.2% (0.7%) 


0.6% (0.5%) 


19.8% (0.7%) 


20.0% (0.8%) 


Test error 


22.8% (3.5%) 


24.2% (3.9%) 


28.1% (5.0%) 


24.0% (4.0%) 


24.0% (3.9%) 


Test error p- value 




0.00 


0.00 


0.00 


0.00 


Ensemble size 






1978.1 (43.0) 


309.5 (37.6) 


1196.3 (765.7) 


spa 


Train error 


11.1% (0.4%) 


7.9% (0.5%) 


1.4% (0.1%) 


6.1% (0.2%) 


7.6% (0.8%) 


Test error 


11.3% (1.2%) 


9.0% (1.3%) 


5.7% (0.8%) 


6.7% (1.2%) 


8.3% (1.4%) 


Test error p- value 


0.00 


0.00 




0.00 


0.00 


Ensemble size 






2000.0 (0.0) 


331.1 (28.4) 


6890.0 (2938.1) 



compared to Boosting. 

Despite its larger ensemble size, On-EEL is dominated by 
Off-EEL on all problems but pid, where both approaches 
generate identical test error rates. A tentative explanation 
stems from the nature of the two approaches, with Ojff-EEL 
having a clear algorithm organized in two stages, classifiers 
evolution with diversity-enhancing fitness followed by en- 
semble construction, while On-EEL is more complex, with 
a succession of ensemble construction and classifiers evo- 
lution with diversity-enforcing measure taken relatively to 
the current ensemble. The dynamics of On-EEL is hard 
to understand, but it can be speculated that the iterative 
construction of the ensemble (without individual removal) 
is prone to be stuck in local optima. Indeed, the "construc- 
tion path" taken to build the ensemble begins with a selec- 
tion of some (supposed poor) individuals at the beginning 
of the evolution. As these individuals cannot be removed 
from the ensemble, they significantly influence the choice of 
other individuals, biasing and possibly misleading the whole 
process. 

6. DISCUSSION AND PERSPECTIVES 

This paper has examined the "Evolutionary Ensemble 
Learning for Free" claim, based on the fact that, since Evo- 



lutionary Algorithms maintain a population of solutions, it 
comes naturally to use these populations as a pool for build- 
ing classifier ensembles. 

Two main issues have been studied, respectively concerned 
with enforcing the diversity of the population of classifiers, 
and with selecting the classifiers either in the final popula- 
tion or along evolution. 

The use of a co-evolution-inspired fitness function, gener- 
alizing [18], was found sufficient to generate diverse classi- 
fiers. As already noted, there is a great similarity between 
the co-evolution of programs and fitness cases [T3] and the 
Boosting principles [8] ; the common idea is that good classi- 
fiers are learned from good examples, while good examples 
are generalized by good classifiers. The difference between 
Boosting and co-evolution is that in Boosting, the train- 
ing examples are not evolved; instead, their weights are up- 
dated. However, the uncontrolled growth of some weights, 
typically in the case of noisy examples, actually appears as 
the Achilles' heel of Boosting compared to Bagging. Basi- 
cally, AdaBoost can be viewed as a dynamic system [22] ; 
the possible instability or periodicity of this dynamic sys- 
tem has undesired consequences on the ensemble learning 
performance. The use of co-evolutionary ideas, even though 
the set of ensemble does not evolve, seems to increase the 



stability of the learning process. 

The two EEL frameworks investigated in this paper can 
be considered as promising. Off-EEL constructs ensembles 
with best performances while needing little modifications 
over a traditional evolutionary algorithm, with a diversity- 
enhancing fitness and the construction of an ensemble from 
the final population. But the size of the ensembles gener- 
ated suggests that bigger population would lead to bigger 
and possibly better ensembles. For the sake of scalability, 
this suggests that the ensemble should be gradually con- 
structed along evolution, instead of considering only the fi- 
nal population. This has been explored with On-EEL, with 
lesser performance comparing to Off-EEL. It is suggested 
that ensemble construction with On-EEL is prone to be 
stuck in local minima, so some capability of removing in- 
dividuals can be beneficial, at the risk of inducing an highly 
dynamic algorithm. Ultimately, the momentum and dynam- 
ics of EEL should be controlled by evolution itself, enforcing 
some trade-off between exploring new regions and preserv- 
ing efficient optimization. This will be the subject of future 
researches. 
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